Skip to content

Conversation

@maleksan85
Copy link
Contributor

@maleksan85 maleksan85 commented Oct 16, 2025

GPT OSS, m and n to check: ROCm@bcc4e69

HIP_VISIBLE_DEVICES=7 \
HSA_NO_SCRATCH_RECLAIM=1 \
NCCL_MIN_NCHANNELS=112 \
USE_FASTSAFETENSOR=1 \
SAFETENSORS_FAST_GPU=1 \
VLLM_DISABLE_COMPILE_CACHE=1 \
VLLM_ROCM_USE_AITER=1 \
VLLM_USE_AITER_UNIFIED_ATTENTION=1 \
VLLM_ROCM_USE_AITER_MHA=0 \
vllm serve /data/models/openai/gpt-oss-120b \
    --host localhost \
    --port 30000 \
    --tensor-parallel-size 1 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 64 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 2048 \
    --swap-space 16 \
    --block-size 64 \
    --async-scheduling \
    --no-enable-prefix-caching \
    --disable-log-requests \
    --compilation-config='{"pass_config":{"enable_attn_fusion":true,"enable_noop":true,"enable_fusion":true},"cudagraph_mode":"FULL","custom_ops":["+rms_norm","+silu_and_mul","+quant_fp8"],"splitting_ops":[]}'
vllm bench serve \
  --host localhost \
  --port 30000 \
  --model /data/models/openai/gpt-oss-120b \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --random-prefix-len 0 \
  --request-rate "inf" \
  --max-concurrency 64 \
  --num-prompts 640 \
  --ignore-eos \
  --percentile-metrics ttft,tpot,itl,e2el
HIP_VISIBLE_DEVICES=7 \
HSA_NO_SCRATCH_RECLAIM=1 \
NCCL_MIN_NCHANNELS=112 \
USE_FASTSAFETENSOR=1 \
SAFETENSORS_FAST_GPU=1 \
VLLM_DISABLE_COMPILE_CACHE=1 \
VLLM_ROCM_USE_AITER=1 \
VLLM_USE_AITER_UNIFIED_ATTENTION=1 \
VLLM_ROCM_USE_AITER_MHA=0 \
python /data/vllm-scripts/llm_test.py \
    --model /data/models/openai/gpt-oss-120b \
    --dataset-path /data/models/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json \
    --batch-size 4 \
    --tensor-parallel-size 1 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 32 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 2048 \
    --swap-space 16 \
    --block-size 64 \
    --async-scheduling \
    --no-enable-prefix-caching \
    --compilation-config='{"pass_config":{"enable_attn_fusion":true,"enable_noop":true,"enable_fusion":true},"cudagraph_mode":"FULL","custom_ops":["+rms_norm","+silu_and_mul","+quant_fp8"],"splitting_ops":[]}'

with change:

============ Serving Benchmark Result ============
Successful requests:                     640
Failed requests:                         0
Maximum request concurrency:             64
Benchmark duration (s):                  132.96
Total input tokens:                      655360
Total generated tokens:                  655360
Request throughput (req/s):              4.81
Output token throughput (tok/s):         4929.13
Peak output token throughput (tok/s):    5696.00
Peak concurrent requests:                128.00
Total Token throughput (tok/s):          9858.26
---------------Time to First Token----------------
Mean TTFT (ms):                          400.59
Median TTFT (ms):                        352.67
P99 TTFT (ms):                           1099.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.60
Median TPOT (ms):                        12.57
P99 TPOT (ms):                           13.53
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.60
Median ITL (ms):                         11.89
P99 ITL (ms):                            13.90
----------------End-to-end Latency----------------
Mean E2EL (ms):                          13290.38
Median E2EL (ms):                        13212.19
P99 E2EL (ms):                           14837.24
==================================================

without change

============ Serving Benchmark Result ============
Successful requests:                     640
Failed requests:                         0
Maximum request concurrency:             64
Benchmark duration (s):                  134.69
Total input tokens:                      655360
Total generated tokens:                  655360
Request throughput (req/s):              4.75
Output token throughput (tok/s):         4865.53
Peak output token throughput (tok/s):    5632.00
Peak concurrent requests:                128.00
Total Token throughput (tok/s):          9731.05
---------------Time to First Token----------------
Mean TTFT (ms):                          396.52
Median TTFT (ms):                        387.08
P99 TTFT (ms):                           966.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.77
Median TPOT (ms):                        12.76
P99 TPOT (ms):                           13.43
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.77
Median ITL (ms):                         12.09
P99 ITL (ms):                            14.52
----------------End-to-end Latency----------------
Mean E2EL (ms):                          13464.16
Median E2EL (ms):                        13493.87
P99 E2EL (ms):                           14258.33
==================================================
with triton gemm_a16w16
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                   _gemm_a16_w16_kernel         0.00%       0.000us         0.00%       0.000us       0.000us        2.492s        12.20%        2.492s      15.687us        158877


no triton gemm:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Cijk_Alik_Bljk_BBS_BH_Bias_HA_S_SAV_UserArgs_MT32x64...         0.00%       0.000us         0.00%       0.000us       0.000us        1.415s         6.82%        1.415s      26.839us         52704
Cijk_Alik_Bljk_BBS_BH_Bias_HA_S_SAV_UserArgs_MT16x64...         0.00%       0.000us         0.00%       0.000us       0.000us     829.718ms         4.00%     829.718ms      15.754us         52668
Cijk_Alik_Bljk_BBS_BH_Bias_HA_S_SAV_UserArgs_MT16x16...         0.00%       0.000us         0.00%       0.000us       0.000us     746.112ms         3.60%     746.112ms      14.147us         52740

1.2x speed up over hipBlasLT

lm_eval --model vllm --model_args pretrained=/data/models/openai/gpt-oss-120b,tensor_parallel_size=1,max_gen_toks=2048 --tasks gsm8k --batch_size auto --num_fewshot 5 --limit 250 --apply_chat_template

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
@mergify mergify bot added the rocm Related to AMD ROCm label Oct 16, 2025
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Aleksandr Malyshev added 4 commits October 20, 2025 21:54
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
@gshtras gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 28, 2025
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm tool-calling

Projects

Status: No status
Status: To Triage

Development

Successfully merging this pull request may close these issues.

3 participants